Skip to content

Conversation

@srcarroll
Copy link
Contributor

@srcarroll srcarroll commented Jun 19, 2025

This patch modifies FuseElementwiseOps to support fusing named ops, such as linalg.map and linalg.elementwise, which are always elementwise. The fundamental logic remains the same and, for the most part, we need only to change GenericOp to LinalgOp.

@srcarroll srcarroll changed the title Generalize fusion [mlir][linalg] Elementwise fusion for linalg.map op Jun 19, 2025
@srcarroll srcarroll changed the title [mlir][linalg] Elementwise fusion for linalg.map op [mlir][linalg] Elementwise fusion for any LinalgOp Jun 19, 2025
@github-actions
Copy link

github-actions bot commented Jun 19, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@srcarroll
Copy link
Contributor Author

srcarroll commented Jun 19, 2025

The current changes in this PR will fuse linalg.map ops, but produce a linalg.generic even when it is technically possible to generate another linalg.map. I dont know enough to modify the code in a sane way to get linalg.map outputs (when appropriate) so if anyone has any ideas, i would love to hear. I dont think it's important enough to go crazy with, but would be nice.

@srcarroll
Copy link
Contributor Author

srcarroll commented Jun 19, 2025

Although i didn't expect differences between implementing for generic vs map, but i am seeing a failure in a case with linalg.map that passes for the equivalent linalg.generic version.

map version

func.func @main() -> (tensor<1xi64>, tensor<1xi64>) {
    %cst_16 = arith.constant dense<0> : tensor<1xi64>
    %cst_17 = arith.constant dense<1> : tensor<1xi64>
    %cst_3 = arith.constant dense<26> : tensor<1xi64>
    %c0_i64 = arith.constant 0 : i64
    %c1_i64 = arith.constant 1 : i64
    %9 = tensor.empty() : tensor<1xi1>
    %mapped_25 = linalg.map ins(%cst_17 : tensor<1xi64>) outs(%9 : tensor<1xi1>)
      (%in: i64) {
        %255 = arith.cmpi slt, %in, %c0_i64 : i64
        linalg.yield %255 : i1
      }
    %11 = tensor.empty() : tensor<1xi64>
    %mapped_27 = linalg.map ins(%cst_17 : tensor<1xi64>) outs(%11 : tensor<1xi64>)
      (%in: i64) {
        %255 = arith.subi %c0_i64, %in : i64
        linalg.yield %255 : i64
      }
    %mapped_28 = linalg.map ins(%cst_3 : tensor<1xi64>) outs(%11 : tensor<1xi64>)
      (%in: i64) {
        %255 = arith.addi %in, %c1_i64 : i64
        linalg.yield %255 : i64
      }
    %mapped_29 = linalg.map { arith.select } ins(%mapped_25, %mapped_28, %cst_16 : tensor<1xi1>, tensor<1xi64>, tensor<1xi64>) outs(%11 : tensor<1xi64>)
    %mapped_30 = linalg.map { arith.select } ins(%mapped_25, %mapped_27, %cst_17 : tensor<1xi1>, tensor<1xi64>, tensor<1xi64>) outs(%11 : tensor<1xi64>)
    return %mapped_29, %mapped_30 : tensor<1xi64>, tensor<1xi64>
}

fails with

failed-fusion.mlir:25:18: error: 'linalg.generic' op expected type of bb argument #0 ('i64') to match element or self type of the corresponding operand ('i1')
    %mapped_30 = linalg.map { arith.select } ins(%mapped_25, %mapped_27, %cst_17 : tensor<1xi1>, tensor<1xi64>, tensor<1xi64>) outs(%11 : tensor<1xi64>)
                 ^
failed-fusion.mlir:25:18: note: see current operation: 
%5:2 = "linalg.generic"(%2, %3) <{indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = [#linalg.iterator_type<parallel>], operandSegmentSizes = array<i32: 0, 2>}> ({
^bb0(%arg0: i64, %arg1: i64):
  %6 = "arith.cmpi"(%arg0, %1) <{predicate = 2 : i64}> : (i64, i64) -> i1
  %7 = "arith.select"(%6, %0, %arg1) : (i1, i64, i64) -> i64
  "linalg.yield"(%6, %7) : (i1, i64) -> ()
}) : (tensor<1xi1>, tensor<1xi64>) -> (tensor<1xi1>, tensor<1xi64>)

generic version

#map = affine_map<(d0)->(d0)>
func.func @main() -> (tensor<1xi64>, tensor<1xi64>) {
    %cst_16 = arith.constant dense<0> : tensor<1xi64>
    %cst_17 = arith.constant dense<1> : tensor<1xi64>
    %cst_3 = arith.constant dense<26> : tensor<1xi64>
    %c0_i64 = arith.constant 0 : i64
    %c1_i64 = arith.constant 1 : i64
    %9 = tensor.empty() : tensor<1xi1>
    %mapped_25 = linalg.generic {
      indexing_maps = [#map, #map],
      iterator_types = ["parallel"]}
      ins(%cst_17 : tensor<1xi64>) outs(%9 : tensor<1xi1>) {
    ^bb0(%b0: i64, %b1: i1):
        %255 = arith.cmpi slt, %b0, %c0_i64 : i64
        linalg.yield %255 : i1
    } -> tensor<1xi1>
    %11 = tensor.empty() : tensor<1xi64>
    %mapped_27 = linalg.generic {
      indexing_maps = [#map, #map],
      iterator_types = ["parallel"]}
      ins(%cst_17 : tensor<1xi64>) outs(%11 : tensor<1xi64>) {
    ^bb0(%b0: i64, %b1: i64):
        %255 = arith.subi %c0_i64, %b0 : i64
        linalg.yield %255 : i64
    } -> tensor<1xi64>
    %mapped_28 = linalg.generic {
      indexing_maps = [#map, #map],
      iterator_types = ["parallel"]}
      ins(%cst_3 : tensor<1xi64>) outs(%11 : tensor<1xi64>) {
    ^bb0(%b0: i64, %b1: i64):
        %255 = arith.addi %b0, %c1_i64 : i64
        linalg.yield %255 : i64
    } -> tensor<1xi64>
    %mapped_29 = linalg.generic {
      indexing_maps = [#map, #map, #map, #map],
      iterator_types = ["parallel"]}
      ins(%mapped_25, %mapped_28, %cst_16 : tensor<1xi1>, tensor<1xi64>, tensor<1xi64>) outs(%11 : tensor<1xi64>) {
    ^bb0(%b0: i1, %b1: i64, %b2: i64, %b3: i64):
        %255 = arith.select %b0, %b1, %b2 : i64
        linalg.yield %255 : i64
    } -> tensor<1xi64>
    %mapped_30 = linalg.generic {
      indexing_maps = [#map, #map, #map, #map],
      iterator_types = ["parallel"]}
      ins(%mapped_25, %mapped_27, %cst_17 : tensor<1xi1>, tensor<1xi64>, tensor<1xi64>) outs(%11 : tensor<1xi64>) {
    ^bb0(%b0: i1, %b1: i64, %b2: i64, %b3: i64):
        %255 = arith.select %b0, %b1, %b2 : i64
        linalg.yield %255 : i64
    } -> tensor<1xi64>
    return %mapped_29, %mapped_30 : tensor<1xi64>, tensor<1xi64>
}

successfully fuses and generates

  func.func @main() -> (tensor<1xi64>, tensor<1xi64>) {
    %c1_i64 = arith.constant 1 : i64
    %c0_i64 = arith.constant 0 : i64
    %0 = tensor.empty() : tensor<1xi64>
    %1 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]} outs(%0 : tensor<1xi64>) {
    ^bb0(%out: i64):
      linalg.yield %c0_i64 : i64
    } -> tensor<1xi64>
    %2 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]} outs(%0 : tensor<1xi64>) {
    ^bb0(%out: i64):
      linalg.yield %c1_i64 : i64
    } -> tensor<1xi64>
    return %1, %2 : tensor<1xi64>, tensor<1xi64>
  }

since linalg-generalize-named-ops doesn't convert linalg.map i had to translate by hand, so hopefully didn't mess it up.

It's strange that this case doesn't constant fold in the first place. The first 3 generics can definitely be constant folded to dense<false>, dense<1> and dense<27>, respectively. Which means the last two ops would equate to dense<0> and dense<1>, which is what the above fused result for the generics suggests (but again, can be constant folded)

@srcarroll
Copy link
Contributor Author

srcarroll commented Jun 19, 2025

I think I know why this fails. The fundamental difference between linalg.map and linalg.generic is that the former doesn't have block arguments for the init operands. So any logic involving changing/mapping output block args won't be guaranteed to work generically in the current state. I'll try to figure something out

// CHECK-NEXT: %[[EXP0:.*]] = math.exp %[[IN1]]
// CHECK-NEXT: %[[SQRT0:.*]] = math.sqrt %[[IN0]]
// CHECK-NEXT: %[[EXP1:.*]] = math.exp %[[IN1]]
// CHECK-NEXT: %[[SQRT1:.*]] = math.sqrt %[[IN0]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the duplicate computations are an old artifact. these do go away with cse but let me know if this is something that should be looked at

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed odd. Looks like a bug in the fuser. Could be related to the map vs generic issue you've seen above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did make a generic version of this and ran the old version of the pass and got same results to confirm it's a pre-existing thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the generic version

#map = affine_map<(d0)->(d0)>
func.func @map_ops_mixed_types(%arg0: tensor<8xf32>, %arg1: tensor<8xf32>, %arg3: tensor<8xf32>) -> tensor<8xf32> {
  %init = tensor.empty() : tensor<8xi1>
  %initf = tensor.empty() : tensor<8xf32>
  %0 = linalg.generic {
      indexing_maps = [#map, #map],
      iterator_types = ["parallel"]} ins(%arg0 : tensor<8xf32>) outs(%initf : tensor<8xf32>) {
    ^bb0(%in0 : f32, %out : f32):
        %sqrt = math.sqrt %in0 : f32
        linalg.yield %sqrt : f32
    } -> tensor<8xf32>
  %1 = linalg.generic {
      indexing_maps = [#map, #map],
      iterator_types = ["parallel"]} ins(%arg1 : tensor<8xf32>) outs(%initf : tensor<8xf32>) {
    ^bb0(%in0 : f32, %out : f32):
        %sqrt = math.exp %in0 : f32
        linalg.yield %sqrt : f32
    } -> tensor<8xf32>
  %2 = linalg.generic {
      indexing_maps = [#map, #map, #map],
      iterator_types = ["parallel"]} ins(%0, %1 : tensor<8xf32>, tensor<8xf32>) outs(%init : tensor<8xi1>) {
    ^bb0(%in0 : f32, %in1 : f32, %out: i1):
      %cmp = arith.cmpf olt, %in0, %in1 : f32
      linalg.yield %cmp : i1
  } -> tensor<8xi1>
  %3 = linalg.generic {
      indexing_maps = [#map, #map, #map, #map],
      iterator_types = ["parallel"]} ins(%2, %0, %1 : tensor<8xi1>, tensor<8xf32>, tensor<8xf32>) outs(%initf : tensor<8xf32>) { 
    ^bb0(%in0 : i1, %in1 : f32, %in2 : f32, %out: f32):
      %select = arith.select %in0, %in1, %in2 : f32
      linalg.yield %select : f32
  } -> tensor<8xf32>
  return %3 : tensor<8xf32>
}

@llvmbot
Copy link
Member

llvmbot commented Jun 19, 2025

@llvm/pr-subscribers-mlir-linalg

@llvm/pr-subscribers-mlir

Author: None (srcarroll)

Changes

This patch modifies FuseElementwiseOps and related code to work on any LinalgOp rather than only GenericOps. For example, this enables fusion for MapOps, which are always elementwise. The fundamental logic is unchanged and, for the most part, the changes are simply changing types from GenericOp to LinalgOp.


Full diff: https://github.com/llvm/llvm-project/pull/144922.diff

4 Files Affected:

  • (modified) mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h (+2-2)
  • (modified) mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp (+42-21)
  • (modified) mlir/test/Dialect/Linalg/fusion-elementwise-ops.mlir (+21)
  • (modified) mlir/test/Dialect/Linalg/fusion-elementwise.mlir (+54)
diff --git a/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h b/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
index 147a2907f52e4..f0c8f0de06637 100644
--- a/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
+++ b/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
@@ -529,8 +529,8 @@ fuseElementwiseOps(RewriterBase &rewriter, OpOperand *fusedOperand);
 /// * There is a chance that the implementation of the transformation does not
 /// agree with the result of this method. This function gives a prediction based
 /// on an optimized fusion.
-llvm::SmallDenseSet<int> getPreservedProducerResults(GenericOp producer,
-                                                     GenericOp consumer,
+llvm::SmallDenseSet<int> getPreservedProducerResults(LinalgOp producer,
+                                                     LinalgOp consumer,
                                                      OpOperand *fusedOperand);
 
 /// Try to peel and canonicalize loop `op` and return the new result.
diff --git a/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp b/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp
index f97ed3d6d5111..fc435b47f5977 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp
@@ -77,11 +77,11 @@ static AffineMap getIndexingMapOfProducerOperandsInCoordinatesOfFusedOp(
 // of the fused producer & consumer after the fusion can still compute the
 // bounds of the op.
 static bool isOpOperandCanBeDroppedAfterFusedLinalgs(
-    GenericOp producer, GenericOp consumer,
+    LinalgOp producer, LinalgOp consumer,
     ArrayRef<OpOperand *> opOperandsToIgnore) {
   SmallVector<AffineMap> indexingMaps;
 
-  SmallVector<GenericOp> ops = {producer, consumer};
+  SmallVector<LinalgOp> ops = {producer, consumer};
   for (auto &op : ops) {
     for (auto &opOperand : op->getOpOperands()) {
       if (llvm::is_contained(opOperandsToIgnore, &opOperand)) {
@@ -109,8 +109,9 @@ static bool isOpOperandCanBeDroppedAfterFusedLinalgs(
 /// * There is a chance that the implementation of the transformation does not
 /// agree with the result of this method. This function gives a prediction based
 /// on an optimized fusion.
-llvm::SmallDenseSet<int> mlir::linalg::getPreservedProducerResults(
-    GenericOp producer, GenericOp consumer, OpOperand *fusedOperand) {
+llvm::SmallDenseSet<int>
+mlir::linalg::getPreservedProducerResults(LinalgOp producer, LinalgOp consumer,
+                                          OpOperand *fusedOperand) {
   llvm::SmallDenseSet<int> preservedProducerResults;
   llvm::SmallVector<OpOperand *> opOperandsToIgnore;
 
@@ -140,8 +141,8 @@ bool mlir::linalg::areElementwiseOpsFusable(OpOperand *fusedOperand) {
   if (!fusedOperand)
     return false;
 
-  auto producer = fusedOperand->get().getDefiningOp<GenericOp>();
-  auto consumer = dyn_cast<GenericOp>(fusedOperand->getOwner());
+  auto producer = fusedOperand->get().getDefiningOp<LinalgOp>();
+  auto consumer = dyn_cast<LinalgOp>(fusedOperand->getOwner());
 
   // Check producer and consumer are generic ops.
   if (!producer || !consumer)
@@ -215,16 +216,39 @@ bool mlir::linalg::areElementwiseOpsFusable(OpOperand *fusedOperand) {
 /// Generate the region of the fused tensor operation. The region of the fused
 /// op must be empty.
 static void generateFusedElementwiseOpRegion(
-    RewriterBase &rewriter, GenericOp fusedOp,
+    RewriterBase &rewriter, LinalgOp fusedOp,
     AffineMap consumerToProducerLoopsMap, OpOperand *fusedOperand,
     unsigned nloops, llvm::SmallDenseSet<int> &preservedProducerResults) {
-  auto producer = cast<GenericOp>(fusedOperand->get().getDefiningOp());
-  auto consumer = cast<GenericOp>(fusedOperand->getOwner());
+  auto producer = cast<LinalgOp>(fusedOperand->get().getDefiningOp());
+  auto consumer = cast<LinalgOp>(fusedOperand->getOwner());
   // Build the region of the fused op.
+
+  // Since some ops, like `linalg.map`, do not have block arguments for init operands
+  // then we first "generalize" the block by adding arguments for init operands when
+  // they aren't present. We detect this case by checking if
+  // `getOpOperandsMatchingBBargs() == getDpsInputOperands(); 
   Block &producerBlock = producer->getRegion(0).front();
+  if (producer.getOpOperandsMatchingBBargs() ==
+      producer.getDpsInputOperands()) {
+    for (auto init : producer.getDpsInits()) {
+      Type bbType = isa<ShapedType>(init.getType())
+                        ? cast<ShapedType>(init.getType()).getElementType()
+                        : init.getType();
+      producerBlock.addArgument(bbType, producer.getLoc());
+    }
+  }
   Block &consumerBlock = consumer->getRegion(0).front();
+  if (consumer.getOpOperandsMatchingBBargs() ==
+      consumer.getDpsInputOperands()) {
+    for (auto init : consumer.getDpsInits()) {
+      Type bbType = isa<ShapedType>(init.getType())
+                        ? cast<ShapedType>(init.getType()).getElementType()
+                        : init.getType();
+      consumerBlock.addArgument(bbType, consumer.getLoc());
+    }
+  }
   OpBuilder::InsertionGuard guard(rewriter);
-  Block *fusedBlock = rewriter.createBlock(&fusedOp.getRegion());
+  Block *fusedBlock = rewriter.createBlock(&fusedOp->getRegion(0));
   IRMapping mapper;
 
   // 2. Add an index operation for every fused loop dimension and use the
@@ -330,7 +354,7 @@ static void generateFusedElementwiseOpRegion(
   rewriter.create<YieldOp>(fusedOp.getLoc(), fusedYieldValues);
 
   // Sanity checks.
-  assert(fusedBlock->getNumArguments() == fusedOp.getNumOperands() &&
+  assert(fusedBlock->getNumArguments() == fusedOp->getNumOperands() &&
          "Ill-formed GenericOp region");
 }
 
@@ -340,8 +364,8 @@ mlir::linalg::fuseElementwiseOps(RewriterBase &rewriter,
   assert(areElementwiseOpsFusable(fusedOperand) &&
          "expected elementwise operation pre-conditions to pass");
   auto producerResult = cast<OpResult>(fusedOperand->get());
-  auto producer = cast<GenericOp>(producerResult.getOwner());
-  auto consumer = cast<GenericOp>(fusedOperand->getOwner());
+  auto producer = cast<LinalgOp>(producerResult.getOwner());
+  auto consumer = cast<LinalgOp>(fusedOperand->getOwner());
   // TODO: allow fusing the producer of an output operand.
   assert(consumer.isDpsInput(fusedOperand) &&
          "expected producer of input operand");
@@ -418,10 +442,7 @@ mlir::linalg::fuseElementwiseOps(RewriterBase &rewriter,
   // Generate the fused op.
   auto fusedOp = rewriter.create<GenericOp>(
       consumer.getLoc(), fusedResultTypes, fusedInputOperands,
-      fusedOutputOperands, rewriter.getAffineMapArrayAttr(fusedIndexMaps),
-      consumer.getIteratorTypes(),
-      /*doc=*/nullptr,
-      /*library_call=*/nullptr);
+      fusedOutputOperands, fusedIndexMaps, consumer.getIteratorTypesArray());
   if (!fusedOp.getShapesToLoopsMap()) {
     // Fused op has invalid indexing maps. Typically this means something is off
     // in the input, but going ahead here would result in verification errors.
@@ -460,14 +481,14 @@ mlir::linalg::fuseElementwiseOps(RewriterBase &rewriter,
 
 namespace {
 /// Patterns to fuse a generic op, with the producer of its operands.
-class FuseElementwiseOps : public OpRewritePattern<GenericOp> {
+class FuseElementwiseOps : public OpInterfaceRewritePattern<LinalgOp> {
 public:
   FuseElementwiseOps(MLIRContext *context, ControlFusionFn fun,
                      PatternBenefit benefit = 1)
-      : OpRewritePattern<GenericOp>(context, benefit),
+      : OpInterfaceRewritePattern<LinalgOp>(context, benefit),
         controlFn(std::move(fun)) {}
 
-  LogicalResult matchAndRewrite(GenericOp genericOp,
+  LogicalResult matchAndRewrite(LinalgOp genericOp,
                                 PatternRewriter &rewriter) const override {
     // Find the first operand that is defined by another generic op on tensors.
     for (OpOperand &opOperand : genericOp->getOpOperands()) {
@@ -494,7 +515,7 @@ class FuseElementwiseOps : public OpRewritePattern<GenericOp> {
       rewriter.eraseOp(genericOp);
       return success();
     }
-    return failure();
+    return rewriter.notifyMatchFailure(genericOp, "no fusable operands");
   }
 
 private:
diff --git a/mlir/test/Dialect/Linalg/fusion-elementwise-ops.mlir b/mlir/test/Dialect/Linalg/fusion-elementwise-ops.mlir
index 66fc55fadf8fa..b581567cf57a7 100644
--- a/mlir/test/Dialect/Linalg/fusion-elementwise-ops.mlir
+++ b/mlir/test/Dialect/Linalg/fusion-elementwise-ops.mlir
@@ -1014,3 +1014,24 @@ module {
 //   CHECK-DAG:     %[[T3:.+]] = arith.addf %[[T2]], %[[B1]]
 //       CHECK:     linalg.yield %[[T3]] : f32
 //       CHECK:   return %[[GENERIC]]
+
+// -----
+
+func.func @map_ops(%in1: tensor<8xf32>, %in2: tensor<8xf32>) -> tensor<8xf32> {
+    %fill = tensor.empty() : tensor<8xf32>
+    %add = linalg.map {arith.addf} ins(%in1, %in2: tensor<8xf32>, tensor<8xf32>) outs(%fill: tensor<8xf32>)
+    %mapped_65 = linalg.map { math.sqrt } ins(%add : tensor<8xf32>) outs(%fill : tensor<8xf32>)
+    return %mapped_65 : tensor<8xf32>
+}
+
+// CHECK-LABEL: func @map_ops
+//  CHECK-SAME:   %[[ARG0:[a-zA-Z0-9]+]]: tensor<8xf32>
+//  CHECK-SAME:   %[[ARG1:[a-zA-Z0-9]+]]: tensor<8xf32>
+//       CHECK:   %[[EMPTY:.+]] = tensor.empty() : tensor<8xf32>
+//       CHECK:   %[[FUSED_OP:.+]] = linalg.generic
+//  CHECK-SAME:       ins(%[[ARG0]], %[[ARG1]] : {{.*}}) outs(%[[EMPTY]] :
+//  CHECK-NEXT:   ^bb0(%[[IN0:.*]]: f32, %[[IN1:.*]]: f32, %[[OUT:.*]]: f32):
+//  CHECK-NEXT:     %[[ADD:.*]] = arith.addf %[[IN0]], %[[IN1]]
+//  CHECK-NEXT:     %[[SQRT:.*]] = math.sqrt %[[ADD]]
+//  CHECK-NEXT:     linalg.yield %[[SQRT]] 
+//   CHECK-NOT:   linalg.generic
diff --git a/mlir/test/Dialect/Linalg/fusion-elementwise.mlir b/mlir/test/Dialect/Linalg/fusion-elementwise.mlir
index bd9977f1410b9..18ca8b42fa79c 100644
--- a/mlir/test/Dialect/Linalg/fusion-elementwise.mlir
+++ b/mlir/test/Dialect/Linalg/fusion-elementwise.mlir
@@ -59,3 +59,57 @@ func.func @handle_unused_operands(%arg0: tensor<8xf32>, %arg1: tensor<8xf32>) ->
 //       CHECK:   %[[FUSED_OP:.+]] = linalg.generic
 //  CHECK-SAME:       outs(%[[EMPTY]] :
 //   CHECK-NOT:   linalg.generic
+
+// -----
+
+func.func @map_ops(%in1: tensor<8xf32>, %in2: tensor<8xf32>) -> tensor<8xf32> {
+    %fill = tensor.empty() : tensor<8xf32>
+    %add = linalg.map {arith.addf} ins(%in1, %in2: tensor<8xf32>, tensor<8xf32>) outs(%fill: tensor<8xf32>)
+    %mapped_65 = linalg.map { math.sqrt } ins(%add : tensor<8xf32>) outs(%fill : tensor<8xf32>)
+    return %mapped_65 : tensor<8xf32>
+}
+
+// CHECK-LABEL: func @map_ops
+//  CHECK-SAME:   %[[ARG0:[a-zA-Z0-9]+]]: tensor<8xf32>
+//  CHECK-SAME:   %[[ARG1:[a-zA-Z0-9]+]]: tensor<8xf32>
+//       CHECK:   %[[EMPTY:.+]] = tensor.empty() : tensor<8xf32>
+//       CHECK:   %[[FUSED_OP:.+]] = linalg.generic
+//  CHECK-SAME:       ins(%[[ARG0]], %[[ARG1]] : {{.*}}) outs(%[[EMPTY]] :
+//  CHECK-NEXT:   ^bb0(%[[IN0:.*]]: f32, %[[IN1:.*]]: f32, %[[OUT:.*]]: f32):
+//  CHECK-NEXT:     %[[ADD:.*]] = arith.addf %[[IN0]], %[[IN1]]
+//  CHECK-NEXT:     %[[SQRT:.*]] = math.sqrt %[[ADD]]
+//  CHECK-NEXT:     linalg.yield %[[SQRT]] 
+//   CHECK-NOT:   linalg.map
+
+// -----
+
+func.func @map_ops_mixed_types(%arg0: tensor<8xf32>, %arg1: tensor<8xf32>) -> tensor<8xf32> {
+  %init = tensor.empty() : tensor<8xi1>
+  %initf = tensor.empty() : tensor<8xf32>
+  %0 = linalg.map {math.sqrt} ins(%arg0 : tensor<8xf32>) outs(%initf : tensor<8xf32>)
+  %1 = linalg.map {math.exp} ins(%arg1 : tensor<8xf32>) outs(%initf : tensor<8xf32>)
+  %2 = linalg.map ins(%0, %1 : tensor<8xf32>, tensor<8xf32>) outs (%init : tensor<8xi1>)
+    (%in0 : f32, %in1 : f32) {
+      %cmp = arith.cmpf olt, %in0, %in1 : f32
+      linalg.yield %cmp : i1
+  }
+  %3 = linalg.map { arith.select } ins(%2, %0, %1 : tensor<8xi1>, tensor<8xf32>, tensor<8xf32>) outs(%initf : tensor<8xf32>) 
+  return %3 : tensor<8xf32>
+}
+
+// CHECK-LABEL: func @map_ops_mixed_types
+//  CHECK-SAME:   %[[ARG0:[a-zA-Z0-9]+]]: tensor<8xf32>
+//  CHECK-SAME:   %[[ARG1:[a-zA-Z0-9]+]]: tensor<8xf32>
+//       CHECK:   %[[EMPTY:.+]] = tensor.empty() : tensor<8xf32>
+//       CHECK:   %[[FUSED_OP:.+]] = linalg.generic
+//  CHECK-SAME:       ins(%[[ARG0]], %[[ARG1]] : {{.*}}) outs(%[[EMPTY]] :
+//  CHECK-NEXT:   ^bb0(%[[IN0:.*]]: f32, %[[IN1:.*]]: f32, %[[OUT:.*]]: f32):
+//  CHECK-NEXT:     %[[EXP0:.*]] = math.exp %[[IN1]]
+//  CHECK-NEXT:     %[[SQRT0:.*]] = math.sqrt %[[IN0]]
+//  CHECK-NEXT:     %[[EXP1:.*]] = math.exp %[[IN1]]
+//  CHECK-NEXT:     %[[SQRT1:.*]] = math.sqrt %[[IN0]]
+//  CHECK-NEXT:     %[[CMP:.*]] = arith.cmpf olt, %[[SQRT1]], %[[EXP1]]
+//  CHECK-NEXT:     %[[RES:.*]] = arith.select %[[CMP]], %[[SQRT0]], %[[EXP0]]
+//  CHECK-NEXT:     linalg.yield %[[RES]] 
+//   CHECK-NOT:   linalg.map
+

@srcarroll
Copy link
Contributor Author

The current changes in this PR will fuse linalg.map ops, but produce a linalg.generic even when it is technically possible to generate another linalg.map. I dont know enough to modify the code in a sane way to get linalg.map outputs (when appropriate) so if anyone has any ideas, i would love to hear. I dont think it's important enough to go crazy with, but would be nice.

After a chat with @rengolin here , in particular

[quote="rengolin, post:6, topic:83927"]
In this case, a fusion of two elementwise operations would yield a map
[/quote]

it's probably important enough to warrant discussion

@srcarroll
Copy link
Contributor Author

srcarroll commented Jun 20, 2025

It occurs to me now that options like fuse-generic-ops no longer make sense. Although it didn't make sense to me in the first place since generic ops were the only ops that were fused. I think we can get rid of this option altogether in favor of fuse-generic-ops-control. However, I think the functionality of this should change. Since this option always just uses the setFusedOpOperandLimit<4> control function, how about we just repurpose this option to set the limit number? e.g. fuse-operand-limit=4.

Any thoughts?

Edit: Actually, I was just confused by the naming, but I see now how fuse-generic-ops is intended to be used in the context of the rest of the options. I do still think it isn't necessary if absorbed into fuse-generic-ops-control. So, for example, fuse-operand-limit=0 and fuse-operand-limit=max could cover the same functionality as fuse-generic-ops

@srcarroll srcarroll changed the title [mlir][linalg] Elementwise fusion for any LinalgOp [mlir][linalg] Extend FuseElementwiseOps pattern to work with named ops Jun 20, 2025
@srcarroll
Copy link
Contributor Author

I'm now realizing I should extend all the patterns in populateElementwiseOpsFusionPatterns. It should be similarly straight forward. Since the scope is larger than I initially thought, I'll wait for feedback before making any major changes.

@rengolin rengolin requested a review from adam-smnk June 20, 2025 09:31
@joker-eph
Copy link
Collaborator

joker-eph commented Nov 5, 2025

@srcarroll your take seems reasonable to me. This is aligned with the usual value of working upstream: we bias towards future users rather than incumbent. It is also a matter of encouraging upstream evolution and maintenance: having to jump through too many hoops because of just "fears" of causing downstream to update some APIs is discouraging upstream contributors, in particular those who aren't necessarily paid to work upstream. Reducing the overhead/friction to improve the codebase is critical, and the willingness to break APIs (with reasons) is part of it.

Edit: I'm not judging whether your patch here in particular is good or not, just the angle at which you're looking at it and the kind of rationale you've been trying to apply.

Block &consumerBlock = consumer->getRegion(0).front();
OpBuilder::InsertionGuard guard(rewriter);
Block *fusedBlock = rewriter.createBlock(&fusedOp.getRegion());
Block *fusedBlock = rewriter.createBlock(&fusedOp->getRegion(0));
Copy link
Member

@Groverkss Groverkss Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. Please use interface methods to do it correctly:

/*methodName=*/"getRegionBuilder",

or

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRegionBuilder is wrong since that just gets a function ref, but will use getBlock. thanks for pointing it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm actually unsure how to apply this suggestion. the other one above makes sense since i'm just getting the blocks that have already been created. however this one is creating a block in the op's region and I dont see an interface method for getting a region. If region 0 isn't guaranteed, shouldn't the interface have a method for that?

Anyway, fusedOp doesn't have to be a LinalgOp for these initial changes since the function that calls this one explicitly creates a GenericOp (see https://github.com/llvm/llvm-project/pull/144922/files#diff-a7543973103a3f3abb605911ca6d141dc4ffd4782b2bc0ad57890d11ab72e2c1R422). So it's probably better to just declare fusedOp as GenericOp for this function and revert this line. Any thoughts?

@rengolin had a discussion back when this PR first went up about generating named ops post fusion when possible. I think it was agreed that it makes sense to leave this as a TODO (see #144922 (comment)). So when we get there we can revisit how to do this, unless there's an obvious solution now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, fusedOp doesn't have to be a LinalgOp for these initial changes since the function that calls this one explicitly creates a GenericOp (see https://github.com/llvm/llvm-project/pull/144922/files#diff-a7543973103a3f3abb605911ca6d141dc4ffd4782b2bc0ad57890d11ab72e2c1R422). So it's probably better to just declare fusedOp as GenericOp for this function and revert this line. Any thoughts?

Probably better to declare it as a GenericOp yes, since the transformation always (today) returns a GenericOp anyway.

@rengolin had a discussion back when this PR first went up about generating named ops post fusion when possible. I think it was agreed that it makes sense to leave this as a TODO (see #144922 (comment)). So when we get there we can revisit how to do this, unless there's an obvious solution now.

I think that is a different problem. My main concern is that expecting something from an interface when the interface doesn't gurantee it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a separate problem. i just meant that when we make the changes to fuse to named ops (when possible) then LinalgOp might be re-introduced here and thus run into this issue again. But it is possible that more refactoring would need to happen anyway so wouldn't necessarily run back into this issue again here specifically.

Comment on lines 227 to 228
Block &producerBlock = producer->getRegion(0).front();
Block &consumerBlock = consumer->getRegion(0).front();
Copy link
Member

@Groverkss Groverkss Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot do this on an interface. There is no gurantee that the op implementing this interface has region 0 as the region of the op. You can use the interface methods to get this:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes. thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Groverkss Groverkss dismissed their stale review November 5, 2025 04:58

i just had techincal comments that were resolved

Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have serious concerns on the change as is

@MaheshRavishankar
Copy link
Contributor

@srcarroll your take seems reasonable to me. This is aligned with the usual value of working upstream: we bias towards future users rather than incumbent. It is also a matter of encouraging upstream evolution and maintenance: having to jump through too many hoops because of just "fears" of causing downstream to update some APIs is discouraging upstream contributors, in particular those who aren't necessarily paid to work upstream. Reducing the overhead/friction to improve the codebase is critical, and the willingness to break APIs (with reasons) is part of it.

Please read through the entire thread. It's not just about breakage. This is about expected functionality. This is doing two things at once and that is not what this is intended for. This is a new functionality that is combining two transformation "generalization" and "fusion". I think they should be kept separate

Edit: I'm not judging whether your patch here in particular is good or not, just the angle at which you're looking at it and the kind of rationale you've been trying to apply.

@ftynse
Copy link
Member

ftynse commented Nov 5, 2025

I suspect we will need @llvm/mlir-area-team on this.

My personal opinion is that the change may go through as is, though we may need some notice-related process around it. LLVM de facto does not have any guarantee on C++ API stability. I don't think this was ever codified in the policy, but was discussed repeatedly, e.g. here: https://discourse.llvm.org/t/explicitly-spelling-out-the-lack-of-stability-for-the-c-api-in-the-developer-policy/55952. I will go as far as claim that this is an incentive to put code upstream to shift the update burden. I am also rather reluctant to bloat upstream API with similar-but-different calls.

That being said, since all contributors work on MLIR because they use it downstream, I think it is a common courtesy to minimize extra workload for everyone by staging the change when reasonable. The upstream committer may have to extend extra effort, but they will ultimately save time if everyone else does the same. Usually we do it through a PSA on Discourse that points to the commit, explains how to adapt to it, and provides some time frame after which the change will be effective. This may mean that PR is approved without landing immediately, or the main implementation lands but a switch for the default behavior is change after some period dependent on the complexity of the change. Maybe we need to adapt this policy https://llvm.org/docs/DeveloperPolicy.html#making-potentially-breaking-changes and establish a vendors team for MLIR to get relevant folks notified.

An alternative suggestion is to (1) land this with the default control function disabling fusion of named ops, (2) post a PSA on the forum that the default will change to fusing named ops in ~2 weeks given the change is rather minor, (3) resolve any comments and land the switch for the default. If there are other downstreams than the one @MaheshRavishankar maintains that have concerns, they should chime in. If they don't put in the effort of doing so, why should we put the effort to adapt to them?

@ftynse
Copy link
Member

ftynse commented Nov 5, 2025

On the change itself, the rationale https://mlir.llvm.org/docs/Rationale/RationaleLinalgDialect/#progressive-lowering-dont-lose-information-too-quicklya-nameprogressive_loweringa explicitly lists fusion as a transformation that drops high-level information, which can be interpreted as a desire to give users control over generalization behavior during fusion. This would be aligned with @MaheshRavishankar's suggestion, though not because it breaks downstreams but because it would better reflect the design rationale of linalg. The control function does so, but maybe we can have several pre-defined for an easy switch?

@srcarroll
Copy link
Contributor Author

srcarroll commented Nov 5, 2025

This is a new functionality that is combining two transformation "generalization" and "fusion"

@MaheshRavishankar Not really. It's still only fusion. In the current state, the result of fusion is always a linalg.generic. A lot of times it has to be because the fused result doesn't have a named operation, e.g. fusing linalg.elementwise and linalg.matmul has to be a linalg.generic. In other cases, the fused result can be a named op, e.g. linalg.map + linalg.map -> linalg.map (note that fusing two linalg.elementwises wouldn't work this way because this op only allows one operation in the payload, so it too would have to fuse to linalg.generic, or could fuse to linalg.map only if the indexing maps are identity). I would argue that the former case happens more often the the latter case. So getting linalg.generic, for the most part, is just part of fusion, not that we are explicitly generalizing named ops in order to do fusion. In fact, before these changes a user would HAVE to generalize named ops before fusion in order to opt in to fusing ops that aren't already linalg.generic. And again, these changes still provide the option to opt out of fusion.

@renato and I have already discussed the desire to keep named ops as long as possible. I am in firm agreement with this. See here for discussion. We also agreed that this can be done incrementally (see here), but others welcome to join this conversation and express their thoughts on it.

@srcarroll
Copy link
Contributor Author

srcarroll commented Nov 5, 2025

I suspect we will need @llvm/mlir-area-team on this.

My personal opinion is that the change may go through as is, though we may need some notice-related process around it. LLVM de facto does not have any guarantee on C++ API stability. I don't think this was ever codified in the policy, but was discussed repeatedly, e.g. here: https://discourse.llvm.org/t/explicitly-spelling-out-the-lack-of-stability-for-the-c-api-in-the-developer-policy/55952. I will go as far as claim that this is an incentive to put code upstream to shift the update burden. I am also rather reluctant to bloat upstream API with similar-but-different calls.

That being said, since all contributors work on MLIR because they use it downstream, I think it is a common courtesy to minimize extra workload for everyone by staging the change when reasonable. The upstream committer may have to extend extra effort, but they will ultimately save time if everyone else does the same. Usually we do it through a PSA on Discourse that points to the commit, explains how to adapt to it, and provides some time frame after which the change will be effective. This may mean that PR is approved without landing immediately, or the main implementation lands but a switch for the default behavior is change after some period dependent on the complexity of the change. Maybe we need to adapt this policy https://llvm.org/docs/DeveloperPolicy.html#making-potentially-breaking-changes and establish a vendors team for MLIR to get relevant folks notified.

An alternative suggestion is to (1) land this with the default control function disabling fusion of named ops, (2) post a PSA on the forum that the default will change to fusing named ops in ~2 weeks given the change is rather minor, (3) resolve any comments and land the switch for the default. If there are other downstreams than the one @MaheshRavishankar maintains that have concerns, they should chime in. If they don't put in the effort of doing so, why should we put the effort to adapt to them?

@ftynse Thanks for you input. I'm certainly amenable to making it a smooth transition

On the change itself, the rationale https://mlir.llvm.org/docs/Rationale/RationaleLinalgDialect/#progressive-lowering-dont-lose-information-too-quicklya-nameprogressive_loweringa explicitly lists fusion as a transformation that drops high-level information, which can be interpreted as a desire to give users control over generalization behavior during fusion. This would be aligned with @MaheshRavishankar's suggestion, though not because it breaks downstreams but because it would better reflect the design rationale of linalg. The control function does so, but maybe we can have several pre-defined for an easy switch?

I'm still not quite seeing the benefit of the suggestion over just customizing control logic, but also not dismissing it yet. I still need to look into it to understand what it does for us.

I'm fine with adding some pre-defined control functions. Obviously one of them would be for only fusing generics so people can keep previous behavior if they want. Are there any other noteworthy ones? Happy to hear some suggestions on this.

Edit: I did make the suggestion here to keep the linalg-fuse-elementwise-ops pass the same during the transition. But @MaheshRavishankar noted this might not be enough, which I think is valid.

@srcarroll
Copy link
Contributor Author

srcarroll commented Nov 5, 2025

@ftynse @MaheshRavishankar @rengolin
I was curious what kinds of changes downstream users would need to make in their stack. Really the only main one I'm familiar with that actually uses this pattern is IREE. I think this is a good representative example, but would be glad to hear from others of possible more complex use cases. In IREE, the pattern is used in 4 places; 3 of them use fuseElementwiseOps directly and one uses populateElementwiseOpsFusionPatterns.

  1. https://github.com/iree-org/iree/blob/dca3747642d6d06280c5629106ff0bd2fe0bc4e1/compiler/src/iree/compiler/DispatchCreation/ElementwiseOpFusion.cpp#L176 . Just update fuseElementwiseOpsControlFn to only accept GenericOp if you want to keep current behavior. This is trivial
  2. https://github.com/iree-org/iree/blob/dca3747642d6d06280c5629106ff0bd2fe0bc4e1/compiler/src/iree/compiler/Codegen/Common/RematerializeParallelOps.cpp#L65 . No change needed because the containing pattern is only on GenericOp anyway.
  3. https://github.com/iree-org/iree/blob/dca3747642d6d06280c5629106ff0bd2fe0bc4e1/compiler/src/iree/compiler/Codegen/Common/DecomposeSoftmax.cpp#L56 . Again no change needed for same reason as (2)
  4. https://github.com/iree-org/iree/blob/dca3747642d6d06280c5629106ff0bd2fe0bc4e1/compiler/src/iree/compiler/DispatchCreation/FuseMultiUseElementwiseProducer.cpp#L122 . This is less trivial than the others, so could miss something, but I think it's also a no change situation. The call to fuseElementwiseOps is contained in the doMultiUseFusion function. Although this function isn't specific to GenericOp, it is a one-off static function that is only used here. As far as I can tell, everything in fusedOps is a GenericOp.

So unless I'm wrong in my assessment here, doing the change proposed by @MaheshRavishankar would actual require more changes in IREE, albeit still trivial. I'm not saying this is reason alone to not do the change, but I think it is valuable data that provides more context to the discussion. Note: I have 0 investment in the IREE project, so I'm not cherry-picking, it's just the only relevant downstream example I can think of.

Edit:

doing the change proposed by @MaheshRavishankar would actual require more changes in IREE, albeit still trivial.

no, i forgot the proposal was actually to have two functions such that fuseElementwiseOps remains effectively unchanged for users. So forget this part. But I do think the above is good data for the discussion. In the IREE case, the proposal would make no difference. Still open to hear about other possible cases.

@MaheshRavishankar
Copy link
Contributor

MaheshRavishankar commented Nov 6, 2025

Thanks @ftynse for weighing in. Let me see if @srcarroll and I can resolve the discussion before escalating it to the governance group.

@srcarroll , first up, id suggest you leave IREE out of the discussion. It usually ends up muddling the conversation. I am not going to suggest anything related to what IREE does here (which is something completely different and that we found works best for our stack).

The crux of the difference here is that the current fuseElementwiseOps expects to work on linalg.generic. This operation is effectively the most direct implementation of the LinalgStructuredOpInterface. What you are suggesting is to adapt it to work for any linalg named ops. The change itself is straight-forward cause anything that works for a linalg.generic obviously works for any operation that implements the LinalgStructuredOpInterface.

The concern I have is that doing this change is effectively combining two fairly substantial transformations, (1) generalization of named ops, and (2) elementwise op fusion into one. This will happen whether or not anyone intended for this to happen. So a pass that uses fuseElementwiseOps and intended to take a program that has a mix of linalg.generic operations and Linalg named ops and just fuse the linalg.generic operations (under some cost function), now will also have to account for the named ops being generalized and fused with other linalg.generic/linalg named ops. That is a significant change of functionality.

I am not opposed to having such a functionality, but this needs to be opt-in, since it is two steps combined into one. There are ways we could make this an opt-in, but as structured, this patch does not allow that. This is why I suggested to add a separate entry point allowing you, and any other user to take these two things together if that suits them, or just use the one that handles generic ops. An alternative to having a separate entry point would be to introduce an Options struct similar to what Tiling does, where this can be an opt-in. But in my view this needs to be an opt-in.

With respect to the change itself, I know its a small change to adapt to this. You just need to change the control function you are using to early exit if the producer and consumer are not linalg.generics. I am not concerned about the "magnitude of change", but more about this function now combining generalization and fusion into one step without it being an opt-in.

Hope that helps. I am absolutely not trying to lock having this functionality. It is something worth allowing to do, and can provide value for those that want to use it.

@srcarroll
Copy link
Contributor Author

srcarroll commented Nov 6, 2025

@MaheshRavishankar

@srcarroll , first up, id suggest you leave IREE out of the discussion. It usually ends up muddling the conversation. I am not going to suggest anything related to what IREE does here (which is something completely different and that we found works best for our stack).

I'm just using it as an example to demonstrate how these changes would affect a downstream user. I don't care about IREE itself. Since you aren't giving concrete reasons for why this would "break the world", I'm offering a concrete example of why it doesn't.

The crux of the difference here is that the current fuseElementwiseOps expects to work on linalg.generic

It's only expected now because that's how it used to be defined. Now the definition will be different. Definitions change all the time and it's not unreasonable to expect users to adapt.

The concern I have is that doing this change is effectively combining two fairly substantial transformations, (1) generalization of named ops, and (2) elementwise op fusion into one

I already explained here why that's wrong. If you disagree, then respond to that directly.

I am not opposed to having such a functionality, but this needs to be opt-in, since it is two steps combined into one. There are ways we could make this an opt-in, but as structured, this patch does not allow that

Yes it does. The fuseElementwiseOps function is already is opt-in on a per op basis since you choose whether you call it or not with some Operations OpOperand. If you don't want to fuse a named op then don't call this function when an OpOperands owner is a named op. It's that simple. And as explained, just define the control function for to opt in to whatever you want with populateElementwiseOpsFusionPatterns. So, yes, with these changes as is you have options and can still retain previous functionality if desired. No separate entry point necessary.

As far as I am concerned, I've shown why your concerns are wrong. I could be missing something, but you aren't helping in understanding what I'm missing. I haven't heard any convincing arguments from you yet. All I've heard are claims backed up by nothing. I think we need input from others. @ftynse @rengolin @Groverkss @javedabsar1 @adam-smnk

Edit: Obviously I welcome input involving any of the changes in this PR. But here I'm requesting input specifically about the question of whether these changes offer enough flexibility as is or if another entry point is indeed required.

@rengolin
Copy link
Member

rengolin commented Nov 6, 2025

Not everyone goes down named ops for all elementwise operations. Most common use case is that you use named ops for compute ops like matmul/convs/etc. and use generic ops for elementwise operations.

I'd avoid assuming user intention/behaviour. While this may be true for some, it's not true for everyone. We need to make decisions that are consistent with the dialect, not its current use. (I'm not disagreeing with you, just making clear that usage is not the only proxy for dialect design).

That being said, since all contributors work on MLIR because they use it downstream, I think it is a common courtesy to minimize extra workload for everyone by staging the change when reasonable. The upstream committer may have to extend extra effort, but they will ultimately save time if everyone else does the same.

That's not a reasonable upstream path. You can ask people to help you, and they can agree to help you, but ultimately, your downstream cost must be paid by your team. Other downstream projects already pay their own costs, and it's unreasonable to ask them to pay yours, too. This is a key assumption of the whole LLVM project since its inception and I don't see it changing now.

It is also a matter of encouraging upstream evolution and maintenance: having to jump through too many hoops because of just "fears" of causing downstream to update some APIs is discouraging upstream contributors, in particular those who aren't necessarily paid to work upstream. Reducing the overhead/friction to improve the codebase is critical, and the willingness to break APIs (with reasons) is part of it.

I would not put it quite like @joker-eph above, but the sentiment stands: forcing people to appease to multiple downstream constraints makes it very hard to contribute to the project. This is highly discouraging and conter-productive.

@rengolin Here's another idea. Maybe to start i could just modify the controlFn for linalg-fuse-elementwise-ops pass specifically to only allow linalg.generic. This would keep the current behavior for now, while also giving people more options with the pattern, so everyone would win. Then we can have more time think about if linalg-fuse-elementwise-ops should change.

@srcarroll Adding the generic patterns would be a low-key change and potentially reasonable solution to the conundrum. Remember, Linalg has non-generic operations where fusion doesn't work / makes no sense, so we need to be careful.

However, my concern still stands: if the generic has reduction/transpose/broadcast patterns, then we need to make sure that the fuser can handle them. The current named ops (add, mul) mandate an element-wise affine map, so they would be absolutely fine, but generic, contract and elementwise do allow for arbitrary affine maps, and that could be an issue.


// -----

func.func @map_matmul(%in1: tensor<8x10xf32>, %in2: tensor<10x12xf32>) -> tensor<8x12xf32> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rengolin Per your request I haver added this and the test below. Please let me know if there's more you would like to see

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! That's the kind of thing we need to be testing.

If the matmul has a transpose/broadcast/reduction map on %exp then it shouldn't be fused.

This also applies to contract and elementwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the matmul has a transpose/broadcast/reduction map on %exp then it shouldn't be fused.

Ah yes I forgot matmul was extended to allow different indexing maps. will add more cases involving this. Will also check with contract

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the matmul has a transpose/broadcast/reduction map on %exp then it shouldn't be fused.

@rengolin Actually, why not? I think this is only conditionally true. One case I'm thinking of that involves transpose is valid for fusion, for example

func.func @map_matmul_transpose_a(%in1: tensor<10x8xf32>, %in2: tensor<10x12xf32>) -> tensor<8x12xf32> {
    %fill0 = tensor.empty() : tensor<10x8xf32>
    %exp = linalg.map {math.exp} ins(%in1 : tensor<10x8xf32>) outs(%fill0: tensor<10x8xf32>)
    %fill1 = tensor.empty() : tensor<8x12xf32>
    %matmul = linalg.matmul indexing_maps = [
                       affine_map<(d0, d1, d2) -> (d2, d0)>,
                       affine_map<(d0, d1, d2) -> (d2, d1)>,
                       affine_map<(d0, d1, d2) -> (d0, d1)>
                     ] ins(%exp, %in2 : tensor<10x8xf32>, tensor<10x12xf32>) outs(%fill1 : tensor<8x12xf32>) -> tensor<8x12xf32>
    return %matmul : tensor<8x12xf32>
}

would fuse to

#map = affine_map<(d0, d1, d2) -> (d2, d0)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
  func.func @map_matmul_transpose_a(%arg0: tensor<10x8xf32>, %arg1: tensor<10x12xf32>) -> tensor<8x12xf32> {
    %0 = tensor.empty() : tensor<8x12xf32>
    %1 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0, %arg1 : tensor<10x8xf32>, tensor<10x12xf32>) outs(%0 : tensor<8x12xf32>) {
    ^bb0(%in: f32, %in_0: f32, %out: f32):
      %2 = math.exp %in : f32
      %3 = arith.mulf %2, %in_0 : f32
      %4 = arith.addf %out, %3 : f32
      linalg.yield %4 : f32
    } -> tensor<8x12xf32>
    return %1 : tensor<8x12xf32>
  }
}

broadcast cases can also be valid, for example

func.func @map_matmul_bcast(%in1: tensor<10xf32>, %in2: tensor<10x12xf32>) -> tensor<8x12xf32> {
  %fill0 = tensor.empty() : tensor<10xf32>
  %exp = linalg.map {math.exp} ins(%in1 : tensor<10xf32>) outs(%fill0: tensor<10xf32>)
  %fill1 = tensor.empty() : tensor<8x12xf32>
  %matmul = linalg.matmul indexing_maps = [
                     affine_map<(d0, d1, d2) -> (d2)>,
                     affine_map<(d0, d1, d2) -> (d2, d1)>,
                     affine_map<(d0, d1, d2) -> (d0, d1)>
                   ] ins(%exp, %in2 : tensor<10xf32>, tensor<10x12xf32>) outs(%fill1 : tensor<8x12xf32>) -> tensor<8x12xf32>
  return %matmul : tensor<8x12xf32>
}

fuses to

#map = affine_map<(d0, d1, d2) -> (d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
  func.func @map_matmul_bcast(%arg0: tensor<10xf32>, %arg1: tensor<10x12xf32>) -> tensor<8x12xf32> {
    %0 = tensor.empty() : tensor<8x12xf32>
    %1 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0, %arg1 : tensor<10xf32>, tensor<10x12xf32>) outs(%0 : tensor<8x12xf32>) {
    ^bb0(%in: f32, %in_0: f32, %out: f32):
      %2 = math.exp %in : f32
      %3 = arith.mulf %2, %in_0 : f32
      %4 = arith.addf %out, %3 : f32
      linalg.yield %4 : f32
    } -> tensor<8x12xf32>
    return %1 : tensor<8x12xf32>
  }
}

I'll still need to think about cases, with valid input IR, that should NOT result in fusion for the elementwise + matmul case, but I think I will need a more complex case than this to show that. But just wanted to make sure it is agreed that the above cases are valid fusions cases. And again, if you have a specific test case in mind that I'm not thinking of, I will certainly investigate/add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above were produced with the builtin linalg-fuse-elementwise-ops pass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I should have said "may not be fused" instead.

Copy link
Contributor Author

@srcarroll srcarroll Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool thanks.

And again, if you have a specific test case in mind that I'm not thinking of, I will certainly investigate/add it.

btw, it is not my intention to burden you with coming up with test cases for these changes. I completely accept that responsibility myself. so i just mean if you already have something in mind, please share. so this is more about hoping to get help than expecting it. i will continue adding more cases that i think are illustrative enough of the scope of changes here

@srcarroll
Copy link
Contributor Author

@srcarroll Adding the generic patterns would be a low-key change and potentially reasonable solution to the conundrum. Remember, Linalg has non-generic operations where fusion doesn't work / makes no sense, so we need to be careful.

@rengolin I agree. The areElementwiseOpsFusable function should be used to check if an OpOperand's owner is fusable with its producer. It's used in the FuseElementwiseOps pattern to decide if the pattern should apply and is used in an assert in fuseElementwiseOps (which I don't like this abuse of assert, so I will change it to return a proper Failure). Happy to investigate further to make sure nothing funky happens.

@ftynse
Copy link
Member

ftynse commented Nov 6, 2025

As far as I am concerned, I've shown why your concerns are wrong. I could be missing something, but you aren't helping in understanding what I'm missing. I haven't heard any convincing arguments from you yet. All I've heard are claims backed up by nothing. I think we need input from others. @ftynse @rengolin @Groverkss @javedabsar1 @adam-smnk

IMO, we are getting really passionate about... a function name. Can we just give it a name that makes everyone only slightly unhappy about it and move on?

That's not a reasonable upstream path. You can ask people to help you, and they can agree to help you, but ultimately, your downstream cost must be paid by your team. Other downstream projects already pay their own costs, and it's unreasonable to ask them to pay yours, too. This is a key assumption of the whole LLVM project since its inception and I don't see it changing now.

I think you may be missing the point. I help people in this community, at this point without it being part of my job, and expect nothing in return. Most of the time, these people help me in return eventually, often without me asking. This is the community I'd like to be in. If everybody is only there for their own self-interest, I suspect we get endless irreconcilable arguing over minor details.

@srcarroll
Copy link
Contributor Author

srcarroll commented Nov 6, 2025

IMO, we are getting really passionate about... a function name. Can we just give it a name that makes everyone only slightly unhappy about it and move on?

@ftynse I'm not quite sure what you mean. It's not just about a function name. The original suggestion I'm objecting to was to add a templated function. here is the original suggestion that's the source of the entire argument.

The best way forward might be to add a new method fuseNamedOpsAsElementwiseOps and make the core implementation templatized. This way the original method stays as is and only works on generic ops. And the new methods work on named ops + generic ops.

I've thought about it, and I dont think it makes sense to add a whole new function, whether the old fuseElementwiseOps would be a trivial wrapper around fuseNamedOpsAsElementwiseOps or not. I've explained to death why I think it's unnecessary and no one has given me a legitimate counter argument.

Maybe I just misunderstood you, but I was under the impression that you agreed with this sentiment, which I interpreted from your statement below:

I am also rather reluctant to bloat upstream API with similar-but-different calls.

I'm happy to change any function name to whatever anyone wants. I don't care about names. I care about avoiding unnecessary code. I've gone through a lot of effort to explain why an extra function ultimately provides no utility, and no one has put in any effort to explain why it does. If this is really going to be a blocker, then fine, I'll just appease people and move on with life

@srcarroll
Copy link
Contributor Author

If this is really going to be a blocker, then fine, I'll just appease people and move on with life

But I think this needs to be addressed by the other reviewers. I'm not just going to make useless changes to appease one person.

@Groverkss
Copy link
Member

Groverkss commented Nov 6, 2025

I can give an example of why there are concerns over this being a behavior change for the API. I don't want to be too involved in this discussion, so please prevent nit-picking and just think of why the newly introduced underlying behavior can be weird and confusing to a user because they didnt know what they were opting into.

Earlier in the year, I worked on a named operation linalg.exp_reduction which is sugar over a linalg.generic with multiple reduction combiners, with the first reduction combiner having a special property that allows us to vectorize the generic (reduction over e^(max(x) - x) can be vectorized, while it's not true for 2 reduction combiners in general). I represented this as a linalg named operation, because it really is a sugared linalg.generic with some properties i find hard to write a matcher for.

Here is an example of how this looks like : https://github.com/iree-org/iree/blob/shared/exp_reduction_attention/attention_is_two_matmuls/exp_reduction_f32.mlir#L82 (There is more property definitions on that branch, i prototyped it in IREE, but it could go upstream aswell, just didn't want to add something upstream while i was prototyping).

If this operation is converted to a generic, it is very hard to write a matcher to convert it back, because it's relying on a propertly of e^(max(x) - x) especially over floating points, which is even scarier to match. I would never want any API to generalize it without me asking it to do that, because it is dropping very hard to analyze information. This is what is scary about a method that only worked on generics, now working on named ops, but treating them like generics.

Earlier, if i wanted to use the same API over this exp_reduction op (or any other named op that is very hard to analyze and recover), I would have first deliberatly threw away this information by generalizing it and then use fusion on them. But after this change, the API would throw away my information when I never asked it to! So this transformation went from not ever throwing information to now throwing information away.

This is the scary bit. I never asked my named op to be treated as a generic, it is a named op for good reasons and I want it to be treated as so unless i asked it to be treated like a generic.

For my op, I would actually implement a different kind of elementwise fusion, which still results in the named op and doesn't drop information, because i know how to exactly hold it right, and would have just not opted in using this method for it explicitly.

Now you could claim that we should just not call this on that op, which would be true, but I'm just trying to give you an example on why @MaheshRavishankar is concerned about this. This transformation went from being something that could never drop information to something that can now possible drop information, and that is really the concern here. If there was a way to explicitly say "Yeah go ahead and throw that information", it would be ideal, because I'm explicitly opting in to do so. 2 years down the line, I would be questioning why is a method dropping information when i never gave it permission to do so.

I don't have an opinion here, I know how to hold these things right and just add an if condition but I'm just trying to give you an idea on why people are concerned. In the end, it's just a small change and if it helps people in future being explicit about dropping something, how bad could it be?

@srcarroll
Copy link
Contributor Author

I can give an example of why there are concerns over this being a behavior change for the API. I don't want to be too involved in this discussion, so please prevent nit-picking and just think of why the newly introduced underlying behavior can be weird and confusing to a user because they didnt know what they were opting into.

Earlier in the year, I worked on a named operation linalg.exp_reduction which is sugar over a linalg.generic with multiple reduction combiners, with the first reduction combiner having a special property that allows us to vectorize the generic (reduction over e^(max(x) - x) can be vectorized, while it's not true for 2 reduction combiners in general). I represented this as a linalg named operation, because it really is a sugared linalg.generic with some properties i find hard to write a matcher for.

Here is an example of how this looks like : https://github.com/iree-org/iree/blob/shared/exp_reduction_attention/attention_is_two_matmuls/exp_reduction_f32.mlir#L82 (There is more property definitions on that branch, i prototyped it in IREE, but it could go upstream aswell, just didn't want to add something upstream while i was prototyping).

If this operation is converted to a generic, it is very hard to write a matcher to convert it back, because it's relying on a propertly of e^(max(x) - x) especially over floating points, which is even scarier to match. I would never want any API to generalize it without me asking it to do that, because it is dropping very hard to analyze information. This is what is scary about a method that only worked on generics, now working on named ops, but treating them like generics.

Earlier, if i wanted to use the same API over this exp_reduction op (or any other named op that is very hard to analyze and recover), I would have first deliberatly threw away this information by generalizing it and then use fusion on them. But after this change, the API would throw away my information when I never asked it to! So this transformation went from not ever throwing information to now throwing information away.

This is the scary bit. I never asked my named op to be treated as a generic, it is a named op for good reasons and I want it to be treated as so unless i asked it to be treated like a generic.

For my op, I would actually implement a different kind of elementwise fusion, which still results in the named op and doesn't drop information, because i know how to exactly hold it right, and would have just not opted in using this method for it explicitly.

Now you could claim that we should just not call this on that op, which would be true, but I'm just trying to give you an example on why @MaheshRavishankar is concerned about this. This transformation went from being something that could never drop information to something that can now possible drop information, and that is really the concern here. If there was a way to explicitly say "Yeah go ahead and throw that information", it would be ideal, because I'm explicitly opting in to do so. 2 years down the line, I would be questioning why is a method dropping information when i never gave it permission to do so.

I don't have an opinion here, I know how to hold these things right and just add an if condition but I'm just trying to give you an idea on why people are concerned. In the end, it's just a small change and if it helps people in future being explicit about dropping something, how bad could it be?

@Groverkss Thanks for providing more context and giving some examples. I appreciate the constructive feedback.

Just one thing. This started as just a concern, but then evolved to incorrect claims about these changes not giving the option to retain previous behavior. So it's more the latter I've been objecting to lately.

I still stand behind my argument for why the option is there, albeit requiring minor changes in downstream. However, hearing actual reasons for the initial concern from you does help me understand the importance of providing a smoother transition for people.

@srcarroll
Copy link
Contributor Author

srcarroll commented Nov 7, 2025

@MaheshRavishankar @ftynse a3bba1a . I'm not planning on using fuseElementwiseGenericOps in builtin passes. This is purely for downstream users, per your request, to retain previous functionality. Since fuseElementwiseGenericOps is effectively a subset of fuseElementwiseLinalgOps, it should suffice to test with the latter.

I've noted that fuseElementwiseGenericOps should be deprecated in favor of fuseElementwiseLinalgOps, because I still firmly believe there is no long term utility to having both. I'm making a compromise purely to make the transition smoother for downstream users. I hope this will alleviate your concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants